Archive-name: comp-speech-faq Last-modified: 1993/06/02 comp.speech Frequently Asked Questions ========================== This document is an attempt to answer commonly asked questions and to reduce the bandwidth taken up by these posts and their associated replies. If you have a question, please check this file before you post. The FAQ is not meant to discuss any topic exhaustively. It will hopefully provide readers with pointers on where to find useful information. It also tries to list useful material available elsewhere on the net. This FAQ is posted monthly to comp.speech, comp.answers and news.answers. It is also available for anonymous ftp from the comp.speech archive site svr-ftp.eng.cam.ac.uk:/comp.speech/FAQ If you have not already read the Usenet introductory material posted to "news.announce.newusers", please do. For help with FTP (file transfer protocol) look for a regular posting of "Anonymous FTP List - FAQ" in comp.misc, comp.archives.admin and news.answers amongst others. Admin ----- There are still some unanswered questions in this posting, and some answers are not particularly comprehensive. If you have any comments, suggestions for inclusions, or answers then please post them or email. (A section on Speaker Recognition/Verification would be good). This month there is information on two more speech analysis environments. Thanks to the people who contributed this information. Andrew Hunt Speech Technology Research Group email: andrewh@ee.su.oz.au Department of Electrical Engineering Ph: 61-2-692 4509 University of Sydney, NSW, Australia. Fax: 61-2-692 3847 ========================== Acknowledgements =========================== Thanks to the following for their significant comments and contributions. Barry Arons Joe Campbell Oliver Jakobs Sonja Kowalewski Tony Robinson Mike S[?] Many others have provided useful information. Thanks to all. ============================ Contents ================================= PART 1 - General Q1.1: What is comp.speech? Q1.2: Where are the comp.speech archives? Q1.3: Common abbreviations and jargon. Q1.4: What are related newsgroups and mailing lists? Q1.5: What are related journals and conferences? Q1.6: What speech data is available? Q1.7: Speech File Formats, Conversion and Playing. Q1.8: What "Speech Laboratory Environments" are available? PART 2 - Signal Processing for Speech Q2.1: What speech sampling and signal processing hardware can I use? Q2.2: What signal processing techniques are for speech technology? Q2.3: How do I find the pitch of a speech signal? Q2.4: How do I convert to/from mu-law format? PART 3 - Speech Coding and Compression Q3.1: Speech compression techniques. Q3.2: What are some good references/books on coding/compression? Q3.3: What software is available? PART 4 - Speech Synthesis Q4.1: What is speech synthesis? Q4.2: How can speech synthesis be performed? Q4.3: What are some good references/books on synthesis? Q4.4: What software/hardware is available? PART 5 - Speech Recognition Q5.1: What is speech recognition? Q5.2: How can I build a very simple speech recogniser? Q5.2: What does speaker dependent/adaptive/independent mean? Q5.3: What does small/medium/large/very-large vocabulary mean? Q5.4: What does continuous speech or isolated-word mean? Q5.5: How is speech recognition done? Q5.6: What are some good references/books on recognition? Q5.7: What speech recognition packages are available? PART 6 - Natural Language Processing Q6.1: What are some good references/books on NLP? Q6.2: What NLP software is available? ======================================================================= PART 1 - General Q1.1: What is comp.speech? comp.speech is a newsgroup for discussion of speech technology and speech science. It covers a wide range of issues from application of speech technology, to research, to products and lots more. By nature speech technology is an inter-disciplinary field and the newsgroup reflects this. However, computer application is the basic theme of the group. The following is a list of topics but does not cover all matters related to the field - no order of importance is implied. [1] Speech Recognition - discussion of methodologies, training, techniques, results and applications. This should cover the application of techniques including HMMs, neural-nets and so on to the field. [2] Speech Synthesis - discussion concerning theoretical and practical issues associated with the design of speech synthesis systems. [3] Speech Coding and Compression - both research and application matters. [4] Phonetic/Linguistic Issues - coverage of linguistic and phonetic issues which are relevant to speech technology applications. Could cover parsing, natural language processing, phonology and prosodic work. [5] Speech System Design - issues relating to the application of speech technology to real-world problems. Includes the design of user interfaces, the building of real-time systems and so on. [6] Other matters - relevant conferences, books, public domain software, hardware and related products. ------------------------------------------------------------------------ Q1.2: Where are the comp.speech archives? comp.speech is being archived for anonymous ftp. ftp site: svr-ftp.eng.cam.ac.uk (or 129.169.24.20). directory: comp.speech/archive comp.speech/archive contains the articles as they arrive. Batches of 100 articles are grouped into a shar file, along with an associated file of Subject lines. Other useful information is also available in comp.speech/info. ------------------------------------------------------------------------ Q1.3: Common abbreviations and jargon. ANN - Artificial Neural Network. ASR - Automatic Speech Recognition. ASSP - Acoustics Speech and Signal Processing AVIOS - American Voice I/O Society CELP - Code-book excited linear prediction. COLING - Computational Linguistics DTW - Dynamic time warping. FAQ - Frequently asked questions. HMM - Hidden markov model. IEEE - Institute of Electrical and Electronics Engineers JASA - Journal of the Acoustic Society of America LPC - Linear predictive coding. LVQ - Learned vector quantisation. NLP - Natural Language Processing. NN - Neural Network. TTS - Text-To-Speech (i.e. synthesis). VQ - Vector Quantisation. ------------------------------------------------------------------------ Q1.4: What are related newsgroups and mailing lists? NEWGROUPS comp.ai - Artificial Intelligence newsgroup. Postings on general AI issues, language processing and AI techniques. Has a good FAQ including NLP, NN and other AI information. comp.ai.nat-lang - Natural Language Processing Group Postings regarding Natural Language Processing. Set up to cover a broard range of related issues and different viewpoints. comp.ai.nlang-know-rep - Natural Language Knowledge Representation Moderated group covering Natural Language. comp.ai.neural-nets - discussion of Neural Networks and related issues. There are often posting on speech related matters - phonetic recognition, connectionist grammars and so on. comp.compression - occasional articles on compression of speech. FAQ for comp.compression has some info on audio compression standards. comp.dcom.telecom - Telecommunications newsgroup. Has occasional articles on voice products. comp.dsp - discussion of signal processing - hardware and algorithms and more. Has a good FAQ posting. Has a regular posting of a comprehensive list of Audio File Formats. comp.multimedia - Multi-Media discussion group. Has occasional articles on voice I/O. sci.lang - Language. Discussion about phonetics, phonology, grammar, etymology and lots more. alt.sci.physics.acoustics - some discussion of speech production & perception. alt.binaries.sounds.misc - posting of various sound samples alt.binaries.sounds.d - discussion about sound samples, recording and playback. MAILING LISTS ECTL - Electronic Communal Temporal Lobe Founder & Moderator: David Leip Moderated mailing list for researchers with interests in computer speech interfaces. This list serves a broad community including persons from signal processing, AI, linguistics and human factors. To subscribe, send the following information to: ectl-request@snowhite.cis.uoguelph.ca name, institute, department, daytime phone & e-mail address To access the archive, ftp snowhite.cis.uoguelph.ca, login as anonymous, and supply your local userid as a password. All the ECTL things can be found in pub/ectl. Prosody Mailing List Unmoderated mailing list for discussion of prosody. The aim is to facilitate the spread of information relating to the research of prosody by creating a network of researchers in the field. If you want to participate, send the following one-line message to "listserv@purccvm.bitnet" :- subscribe prosody Your Name foNETiks A monthly newsletter distributed by e-mail. It carries job advertisements, notices of conferences, and other news of general interest to phoneticians, speech scientists and others The current editors are Linda Shockey and Gerry Docherty. To subscribe, send a message to FONETIKS-REQUEST@dev.rdg.ac.uk. Digital Mobile Radio Covers lots of areas include some speech topics including speech coding and speech compression. Mail Peter Decker (dec@dfv.rwth-aachen.de) to subscribe. ------------------------------------------------------------------------ Q1.5: What are related journals and conferences? Try the following commercially oriented magazines... Speech Technology - no longer published Try the following technical journals... IEEE Transactions on Speech and Audio Processing (from Jan 93) Computational Linguistics (COLING) Computer Speech and Language Journal of the Acoustical Society of America (JASA) Transactions of IEEE ASSP AVIOS Journal Try the following conferences... ICASSP Intl. Conference on Acoustics Speech and Signal Processing (IEEE) ICSLP Intl. Conference on Spoken Language Processing EUROSPEECH European Conference on Speech Communication and Technology AVIOS American Voice I/O Society Conference SST Australian Speech Science and Technology Conference ------------------------------------------------------------------------ Q1.6: What speech data is available? A wide range of speech databases have been collected. These databases are primarily for the development of speech synthesis/recognition and for linguistic research. Some databases are free but most appear to be available for a small cost. The databases normally require lots of storage space - do not expect to be able to ftp all the data you want. [There are too many to list here in detail - perhaps someone would like to set up a special posting on speech databases?] PHONEMIC SAMPLES ================ First, some basic data. The following sites have samples of English phonemes (American accent I believe) in Sun audio format files. See Question 1.7 for information on audio file formats. sounds.sdsu.edu:/.1/phonemes phloem.uoregon.edu:/pub/Sun4/lib/phonemes sunsite.unc.edu:/pub/multimedia/sun-sounds/phonemes HOMOPHONE LIST ============== A list of homophones in General American English is available by anonymous FTP from the comp.speech archive site: machine name: svr-ftp.eng.cam.ac.uk directory: comp.speech/data file name: homophones-1.01.txt LINGUISTIC DATA CONSORTIUM (LDC) ================================ Information about the Linguistic Data Consortium is available via anonymous ftp from: ftp.cis.upenn.edu (130.91.6.8) in the directory: /pub/ldc Here are some excerpts from the files in that directory: Briefly stated, the LDC has been established to broaden the collection and distribution of speech and natural language data bases for the purposes of research and technology development in automatic speech recognition, natural language processing and other areas where large amounts of linguistic data are needed. Here is the brief list of corpora: * The TIMIT and NTIMIT speech corpora * The Resource Management speech corpus (RM1, RM2) * The Air Travel Information System (ATIS0) speech corpus * The Association for Computational Linguistics - Data Collection Initiative text corpus (ACL-DCI) * The TI Connected Digits speech corpus (TIDIGITS) * The TI 46-word Isolated Word speech corpus (TI-46) * The Road Rally conversational speech corpora (including "Stonehenge" and "Waterloo" corpora) * The Tipster Information Retrieval Test Collection * The Switchboard speech corpus ("Credit Card" excerpts and portions of the complete Switchboard collection) Further resources to be made available within the first year (or two): * The Machine-Readable Spoken English speech corpus (MARSEC) * The Edinburgh Map Task speech corpus * The Message Understanding Conference (MUC) text corpus of FBI terrorist reports * The Continuous Speech Recognition - Wall Street Journal speech corpus (WSJ-CSR) * The Penn Treebank parsed/tagged text corpus * The Multi-site ATIS speech corpus (ATIS2) * The Air Traffic Control (ATC) speech corpus * The Hansard English/French parallel text corpus * The European Corpus Initiative multi-language text corpus (ECI) * The Int'l Labor Organization/Int'l Trade Union multi-language text corpus (ILO/ITU) * Machine-readable dictionaries/lexical data bases (COMLEX, CELEX) The files in the directory include more detailed information on the individual databases. For further information contact Elizabeth Hodas 441 Williams Hall University of Pennsylvania Philadelphia, PA 19104-6305 Phone: (215) 898-0464 Fax: (215) 573-2175 e-mail: ehodas@walnut.ling.upenn.edu Center for Spoken Language Understanding (CSLU) =============================================== 1. The ISOLET speech database of spoken letters of the English alphabet. The speech is high quality (16 kHz with a noise cancelling microphone). 150 speakers x 26 letters of the English alphabet twice in random order. The "ISOLET" data base can be purchased for $100 by sending an email request to vincew@cse.ogi.edu. (This covers handling, shipping and medium costs). The data base comes with a technical report describing the data. 2. CSLU has a telephone speech corpus of 1000 English alphabets. Callers recite the alphabet with brief pauses between letters. This database is available to not-for-profit institutions for $100. The data base is described in the proceedings of the International Conference on Spoken Language Processing. Contact vincew@cse.ogi.edu if interested. PhonDat - A Large Database of Spoken German =========================================== The PhonDat continuous speech corpora are now available on CD-ROM media (ISO 9660 format). PhonDat I (Diphone Corpus) : 6 CDs (1140.- DM) PhonDat II (Train Enquiries Corpus): 1 CD ( 190.- DM) PhonDat I comprises approx. 20.000, PhonDat II approx. 1500 files signal files in high quality 16-bit 16 KHz recording. The corpora come with a documentation containing the orthographic transcription and a citation form of the utterances, as well as a detailed file format description. A narrow phonetic transcription is available for selected files from corpus I and II. For information and orders contact Barbara Eisen Institut fuer Phonetik Schellingstr. 3 / II D 8000 Munich 40 Tel: +49 / 89 / 2180 -2454 or -2758 Fax: +49 / 89 / 280 03 62 ------------------------------------------------------------------------ Q1.7: Speech File Formats, Conversion and Playing. Section 2 of this FAQ has information on mu-law coding. A very good and very comprehensive list of audio file formats is prepared by Guido van Rossum. The list is posted regularly to comp.dsp and alt.binaries.sounds.misc, amongst others. It includes information on sampling rates, hardware, compression techniques, file format definitions, format conversion, standards, programming hints and lots more. It is much too long to include within this posting. It is also available by ftp from: ftp.cwi.nl directory: /pub file: AudioFormats ------------------------------------------------------------------------ Q1.8: What "Speech Laboratory Environments" are available? First, what is a Speech Laboratory Environment? A speech lab is a software package which provides the capability of recording, playing, analysing, processing, displaying and storing speech. Your computer will require audio input/output capability. The different packages vary greatly in features and capability - best to know what you want before you start looking around. Most general purpose audio processing packages will be able to process speech but do not necessarily have some specialised capabilities for speech (e.g. formant analysis). The following article provides a good survey. Read, C., Buder, E., & Kent, R. "Speech Analysis Systems: An Evaluation" Journal of Speech and Hearing Research, pp 314-332, April 1992. Package: Entropic Signal Processing System (ESPS) and Waves Platform: Range of Unix platforms. Description: ESPS is a very comprehensive set of speech analysis/processing tools for the UNIX environment. The package includes UNIX commands, and a comprehensive C library (which can be accessed from other languages). Waves is a graphical front-end for speech processing. Speech waveforms, spectrograms, pitch traces etc can be displayed, edited and processed in X windows and Openwindows (versions 2 & 3). The HTK (Hidden Markov Model Toolkit) is now available from Entropic. HTK is described in some detail in Section 5 of this FAQ - the section on Speech Recognition. Cost: On request. Contact: Entropic Research Laboratory, Washington Research Laboratory, 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003 (202) 547-1420. email - info@wrl.epi.com Package: CSRE: Canadian Speech Research Environment Platform: IBM/AT-compatibles Description: CSRE is a comprehensive, microcomputer-based system designed to support speech research. CSRE provides a powerful, low-cost facility in support of speech research, using mass-produced and widely-available hardware. The project is non-profit, and relies on the cooperation of researchers at a number of institutions and fees generated when the software is distributed. Functions include speech capture, editing, and replay; several alternative spectral analysis procedures, with color and surface/3D displays; parameter extraction/tracking and tools to automate measurement and support data logging; alternative pitch-extraction systems; parametric speech (KLATT80) and non-speech acoustic synthesis, with a variety of supporting productivity tools; and a comprehensive experiment generator, to support behavioral testing using a variety of common testing protocols. A paper about the whole package can be found in: Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc. of the Second Intl. Conf. on Spoken Language Processing, Edmonton: University of Alberta, pp. 1127-1130. Hardware: Can use a range of data aqcuisition/DSP Cost: Distributed on a cost recovery basis. Availability: For more information on availability contact Krystyna Marciniak - email march@uwovax.uwo.ca Tel (519) 661-3901 Fax (519) 661-3805. For technical information - email ramji@uwovax.uwo.ca Note: Also included in Q4.4 on speech synthesis packages. Package: Signalyze 2.0 from InfoSignal Platform: Macintosh Description: Signalyze's basic conception revolves around up to 100 signals, displayed synchronously in HyperCard fashion on "cards". The program offers a full complement of signal editing features, quite a few spectral analysis tools, manual scoring tools, pitch extraction routines, a good set of signal manipulation tools, and extensive input-output capacity. Handles multiple file formats: Signalyze, MacSpeech Lab, AudioMedia, SoundDesigner II, SoundEdit/MacRecorder, SoundWave, three sound resource formats, and ASCII-text. Sound I/O: Direct sound input from MacRecorder and similar devices, AudioMedia, AudioMedia II and AD IN, some MacADIOS boards and devices, Apple sound input (built-in microphone). Sound output via Macintosh internal sound, some MacADIOS boards and devices as well as via the Digidesign 16-bit boards. Compatibility: MacPlus and higher (including II, IIx, IIcx, IIci, IIfx, IIvx, IIvi, Portable, all PowerBooks, Centris and Quadras). Takes advantage of large and multiple screens and 16/256 color/grayscales. System 7.0 compatible. Runs in background with adjustable priority. Misc: A demo available upon request. Manuals and tutorial included. It is available in English, French, and German. Cost: Individual licence US$350, site license US$500, plus shipping. Contact: North America - Network Technology Corporation 91 Baldwin St., Charlestown MA 02129 Fax: 617-241-5064 Phone: 617-241-9205 Elsewhere - InfoSignal Inc. C.P. 73, 1015 LAUSANNE, Switzerland, FAX: +41 21 691-1372, Email: 76357.1213@COMPUSERVE.COM. Package: Kay Elemetrics CSL (Computer Speech Lab) 4300 Platform: Minimum IBM PC-AT compatible with extended memory (min 2MB) with at least VGA graphics. Optimal would be 386 or 486 machine with more RAM for handling larger amounts of data. Description: Speech analysis package, with optional separate LPC program for analysis/synthesis. Uses its own file format for data, but has some ability to export data as ascii. The main editing/analysis prog (but not the LPC part) has its own macro language, making it easy to perform repetitive tasks. Probably not much use without the extra LPC program, which also allows manipulation of pitch, formant and bandwidth parameters. Hardware includes an internal DSP board for the PC (requires ISA slot), and an external module containing signal processing chips which does A/D and D/A conversion. A speaker and microphone are supplied. Misc: A programmers kit is available for programming signal processing chips (experts only). Manuals included. Cost: Recently approx 6000 pounds sterling. (Less in USA?) Availibility: UK distributors are Wessex Electronics, 114-116 North Street, Downend, Bristol, B16 5SE Tel: 0272 571404. In USA: Kay Elemetrics Corp, 12 Maple Avenue, PO Box 2025, Pine Brook, NJ 07058-9798 Tel:(201) 227-7760 Package: Ptolemy Platform: Sun SPARC, DecStation (MIPS), HP (hppa). Description: Ptolemy provides a highly flexible foundation for the specification, simulation, and rapid prototyping of systems. It is an object oriented framework within which diverse models of computation can co-exist and interact. Ptolemy can be used to model entire systems. Ptolemy has been used for a broad range of applications including signal processing, telecomunications, parallel processing, wireless communications, network design, radio astronomy, real time systems, and hardware/software co-design. Ptolemy has also been used as a lab for signal processing and communications courses. Ptolemy has been developed at UC Berkeley over the past 3 years. Further information, including papers and the complete release notes, is available from the FTP site. Cost: Free Availability: The source code, binaries, and documentation are available by anonymous ftp from "ptolemy.bekeley.edu" - see the README file - ptolemy.berkeley.edu:/pub/README Package: Khoros Description: Public domain image processing package with a basic DSP library. Not particularly applicable to speech, but not bad for the price. Cost: FREE Availability: By anonymous ftp from pprg.eece.unm.edu Can anyone provide information on capability and availability of the following packages? VIEW ILS ("Interactive Laboratory System") MacSpeech Lab (for Mac) SpeechViewer (PC) ======================================================================= PART 2 - Signal Processing for Speech Q2.1: What speech sampling and signal processing hardware can I use? In addition to the following information, have a look at the Audio File format document prepared by Guido van Rossum (see details in Section 1.7). Product: Sun standard audio port (SPARC 1 & 2) Input: 1 channel, 8 bit mu-law encoded (telephone quality) Output: 1 channel, 8 bit mu-law encoded (telephone quality) Product: Ariel Platform: Sun + others? Input: 2 channels, 16bit linear, sample rate 8-96kHz (inc 32, 44.1, 48kHz). Output: 2 channels, 16bit linear, sample rate 8-50kHz (inc 32, 44.1, 48kHz). Contact: Ariel Corp.433 River Road, Highland Park, NJ 08904. Ph: 908-249-2900 Fax: 908-249-2123 DSP BBS: 908-249-2124 Product: IBM RS/6000 ACPA (Audio Capture and Playback Adapter) Description: The card supports PCM, Mu-Law, A-Law and ADPCM at 44.1kHz (& 22.05, 11.025, 8kHz) with 16-bits of resolution in stereo. The card has a built-in DSP (don't know which one). The device also supports various formats for the output data, like big-endian, twos complement, etc. Good noise immunity. The card is used for IBM's VoiceServer (they use the DSP for speech recognition). Apparently, the IBM voiceserver has a speaker-independent vocabulary of over 20,000 words and each ACPA can support two independent sessions at once. Cost: $US495 Contact: ? Product: Sound Galaxy NX , Aztech Systems Platform: PC - DOS,Windows 3.1 Cost: ?? Input: 8bit linear, 4-22 kHz. Output: 8bit linear, 4-44.1 kHz Misc: 11-voice FM Music Synthesizer YM3812; Built-in power amplifier; DSP signal processing support - ST70019SB Hardware ADPCM decompression (2:1,3:1,4:1) Full "AdLib" and "Sound Blaster" compatbility. Software includes a simple Text-to-Speech program "Monologue". Product: Sound Galaxy NX PRO, Aztech Systems Platform: PC - DOS,Windows 3.1 Cost: ?? Input: 2 * 8bit linear, 4-22.05 kHz(stereo), 4-44.1 KHz(mono). Output: 2 * 8bit linear, 4-44.1 kHz(stereo/mono) Misc: 20-voice FM Music Synthesizer; Built-in power amplifier; Stereo Digital/Analog Mixer; Configuration in EEPROM. Hardware ADPCM decompression (2:1,3:1,4:1). Includes DSP signal processing support Full "AdLib" and "Sound Blaster Pro II" compatybility. Software includes a simple Text-to-Speech program "Monologue" and Sampling laboratory for Windows 3.1: WinDAT. Contact: USA (510)6238988 Other PC Sound Cards ============================================================================ sound stereo/mono compatible included voices card & sample rate with ports ============================================================================ Adlib Gold stereo: 8-bit 44.1khz Adlib ? audio 20 (opl3) 1000 16-bit 44.1khz in/out, +2 digital mono: 8-bit 44.1khz mic in, channels 16-bit 44.1khz joystick, MIDI Sound Blaster mono: 8-bit 22.1khz Adlib audio 11 synth. FM synth with in/out, 2 operators joystick, Sound Blaster stereo: 8-bit 22.05khz Adlib audio 22 Pro Basic mono: 8-bit 44.1khz Sound Blaster in/out, joystick, Sound Blaster stereo: 8-bit 22.05khz Adlib audio 11 Pro mono: 8-bit 44.1khz Sound Blaster in/out joystick, MIDI, SCSI Sound Blaster stereo: 8-bit 4-44.1khz Sound Blaster audio 20 16 ASP stereo: 16-bit 4-44.1khz in/out, joystick, MIDI Audio Port mono: 8-bit 22.05khz Adlib audio 11 Sound Blaster in/out, joystick Pro Audio stereo: 8-bit 44.1khz Adlib audio, 20 Spectrum + Pro Audio in/out, Spectrum joystick Pro Audio stereo: 16-bit 44.1khz Adlib audio 20 Spectrum 16 Pro Audio in/out, Spectrum joystick, Sound Blaster MIDI, SCSI Thunder Board stereo: 8-bit 22khz Adlib audio 11 Sound Blaster in/out, joystick Gravis stereo: 8-bit 44.1khz Adlib, audio line 32 sampled Ultrasound mono: 8-bit 44.1khz Sound Blaster in/out, 32 synth. amplified out, (w/16-bit daughtercard) mic in, CD stereo: 16-bit 44.1khz audio in, mono: 16-bit 44.1khz daughterboard ports (for SCSI and 16-bit) MultiSound stereo: 16-bit 44.1kHz Nothing audio 32 sampled 64x oversampling in/out, joystick, MIDI ============================================================================= Can anyone provide information on Mac, NeXT and other hardware? [Help is needed to source more info. How about the following format?] Product: xxx Platform: PC, Mac, Sun, ... Rough Cost (pref $US): Input: e.g. 16bit linear, 8,10,16,32kHz. Output: e.g. 16bit linear, 8,10,16,32kHz. DSP: signal processing support Other: Contact: ------------------------------------------------------------------------ Q2.2: What signal processing techniques are for speech technology? This question is far to big to be answered in a FAQ posting. Fortunately there are many good books which answer the question! Some good introductory books include Digital processing of speech signals; L. R. Rabiner, R. W. Schafer. Englewood Cliffs; London: Prentice-Hall, 1978 Voice and Speech Processing; T. W. Parsons. New York; McGraw Hill 1986 Computer Speech Processing; ed Frank Fallside, William A. Woods Englewood Cliffs: Prentice-Hall, c1985 Digital speech processing : speech coding, synthesis, and recognition edited by A. Nejat Ince; Kluwer Academic Publishers, Boston, c1992 Speech science and technology; edited by Shuzo Saito pub. Ohmsha, Tokyo, c1992 Speech analysis; edited by Ronald W. Schafer, John D. Markel New York, IEEE Press, c1979 Douglas O'Shaughnessy -- Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. ------------------------------------------------------------------------ Q2.3: How do I find the pitch of a speech signal? This topic comes up regularly in the comp.dsp newsgroup. Question 2.5 of the FAQ posting for comp.dsp gives a comprehensive list of references on the definition, perception and processing of pitch. ------------------------------------------------------------------------ Q2.4: How do I convert to/from mu-law format? Mu-law coding is a form of compression for audio signals including speech. It is widely used in the telecommunications field because it improves the signal-to-noise ratio without increasing the amount of data. Typically, mu-law compressed speech is carried in 8-bit samples. It is a companding technqiue. That means that carries more information about the smaller signals than about larger signals. Mu-law coding is provided as standard for the audio input and output of the SUN Sparc stations 1&2 (Sparc 10's are linear). On SUN Sparc systems have a look in the directory /usr/demo/SOUND. Included are table lookup macros for ulaw conversions. [Note however that not all systems will have /usr/demo/SOUND installed as it is optional - see your system admin if it is missing.] OR, here is some sample conversion code in C. # include unsigned char linear2ulaw(/* int */); int ulaw2linear(/* unsigned char */); /* ** This routine converts from linear to ulaw. ** ** Craig Reese: IDA/Supercomputing Research Center ** Joe Campbell: Department of Defense ** 29 September 1989 ** ** References: ** 1) CCITT Recommendation G.711 (very difficult to follow) ** 2) "A New Digital Technique for Implementation of Any ** Continuous PCM Companding Law," Villeret, Michel, ** et al. 1973 IEEE Int. Conf. on Communications, Vol 1, ** 1973, pg. 11.12-11.17 ** 3) MIL-STD-188-113,"Interoperability and Performance Standards ** for Analog-to_Digital Conversion Techniques," ** 17 February 1987 ** ** Input: Signed 16 bit linear sample ** Output: 8 bit ulaw sample */ #define ZEROTRAP /* turn on the trap as per the MIL-STD */ #undef ZEROTRAP #define BIAS 0x84 /* define the add-in bias for 16 bit samples */ #define CLIP 32635 unsigned char linear2ulaw(sample) int sample; { static int exp_lut[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7}; int sign, exponent, mantissa; unsigned char ulawbyte; /* Get the sample into sign-magnitude. */ sign = (sample >> 8) & 0x80; /* set aside the sign */ if(sign != 0) sample = -sample; /* get magnitude */ if(sample > CLIP) sample = CLIP; /* clip the magnitude */ /* Convert from 16 bit linear to ulaw. */ sample = sample + BIAS; exponent = exp_lut[( sample >> 7 ) & 0xFF]; mantissa = (sample >> (exponent + 3)) & 0x0F; ulawbyte = ~(sign | (exponent << 4) | mantissa); #ifdef ZEROTRAP if (ulawbyte == 0) ulawbyte = 0x02; /* optional CCITT trap */ #endif return(ulawbyte); } /* ** This routine converts from ulaw to 16 bit linear. ** ** Craig Reese: IDA/Supercomputing Research Center ** 29 September 1989 ** ** References: ** 1) CCITT Recommendation G.711 (very difficult to follow) ** 2) MIL-STD-188-113,"Interoperability and Performance Standards ** for Analog-to_Digital Conversion Techniques," ** 17 February 1987 ** ** Input: 8 bit ulaw sample ** Output: signed 16 bit linear sample */ int ulaw2linear(ulawbyte) unsigned char ulawbyte; { static int exp_lut[8] = { 0, 132, 396, 924, 1980, 4092, 8316, 16764 }; int sign, exponent, mantissa, sample; ulawbyte = ~ulawbyte; sign = (ulawbyte & 0x80); exponent = (ulawbyte >> 4) & 0x07; mantissa = ulawbyte & 0x0F; sample = exp_lut[exponent] + (mantissa << (exponent + 3)); if(sign != 0) sample = -sample; return(sample); } ======================================================================= PART 3 - Speech Coding and Compression Q3.1: Speech compression techniques. Can anyone provide a 1-2 page summary on speech compression? Topics to cover might include common technqiues, where speech compression might be used and perhaps something on why speech is difficult to compress. [The FAQ for comp.compression includes a few questions and answers on the compression of speech.] ------------------------------------------------------------------------ Q3.2: What are some good references/books on coding/compression? Douglas O'Shaughnessy -- Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. ------------------------------------------------------------------------ Q3.3: What software is available? Note: there are two types of speech compression technique referred to below. Lossless technqiues preserve the speech through a compression-decompression phase. Lossy techniques do not preserve the speech prefectly. As a general rule, the more you compress speech, the more the quality degardes. Package: shorten - a lossless compressor for speech signals Platform: UNIX/DOS Description: A lossless compressor for speech signals. It will compile and run on UNIX workstations and will cope with a wide variety of formats. Compression is typically 50% for 16bit clean speech sampled at 16kHz. Availability: Anonymous ftp svr-ftp.eng.cam.ac.uk: /misc/shorten-0.4.shar Package: CELP 3.2 (U.S. Fed-Std-1016 compatible coder) Platform: Sun (the makefiles & source can be modified for other platforms) Description: CELP is lossy compression technqiue. The U.S. DoD's Federal-Standard-1016 based 4800 bps code excited linear prediction voice coder version 3.2 (CELP 3.2) Fortran and C simulation source codes. Contact: Joe Campbell Availability: Anoymous ftp to furmint.nectar.cs.cmu.edu (128.2.209.111): celp.audio.compression (C src in celp.audio.compression/celp32c). Thanks to Vince Cate for providing this site :-) The CELP release package is also available, at no charge, on DOS disks from: Bob Fenichel National Communications System, Washington, D.C. 20305, USA Ph: 1-703-692-2124 Fax: 1-703-746-4960 The following documents are vital to successful real-time implementations and they are also available from Bob Fenichel (they're unavailable electronically): "Details to Assist in Implementation of Federal Standard 1016 CELP," National Communications System, Office of Technology & Standards, 1992. Technical Information Bulletin 92-1. "Telecommunications: Analog-to-Digital Conversion of Radio Voice by 4,800 bit/second Code Excited Linear Prediction (CELP)," National Communications System, Office of Technology & Standards, 1991. Federal Standard 1016. Package: 32 kbps ADPCM Platform: SGI and Sun Sparcs Description: 32 kbps ADPCM C-source code (G.721 compatibility is uncertain) Contact: Jack Jansen Availablity: Anoymous ftp to ftp.cwi.nl: pub/adpcm.shar Package: GSM 06.10 Platform: Runs faster than real time on most Sun SPARCstations Description: GSM 06.10 is lossy compression technqiue. European GSM 06.10 provisional standard for full-rate speech transcoding, prI-ETS 300 036, which uses RPE/LTP (residual pulse excitation/long term prediction) coding at 13 kbit/s. Contact: Carsten Bormann Availability: An implementation can be ftp'ed from: tub.cs.tu-berlin.de: /pub/tubmik/gsm-1.0.tar.Z +/pub/tubmik/gsm-1.0-patch1 or as a faster but not always up-to-date alternative: liasun3.epfl.ch: /pub/audio/gsm-1.0pl1.tar.Z Package: U.S.F.S. 1016 CELP vocoder for DSP56001 Platform: DSP56001 Description: Real-time U.S.F.S. 1016 CELP vocoder that runs on a single 27MHz Motorola DSP56001. Free demo software available from PC-56 and PC-56D. Source and object code available for a one-time license fee. Contact: Cole Erskine Analogical Systems 2916 Ramona St. Palo Alto, CA 94306, USA Tel:(415) 323-3232 FAX:(415) 323-4222 Internet: cole@analogical.com ======================================================================= PART 4 - Speech Synthesis Q4.1: What is speech synthesis? Speech synthesis is the task of transforming written input to spoken output. The input can either be provided in a graphemic/orthographic or a phonemic script, depending on its source. ------------------------------------------------------------------------ Q4.2: How can speech synthesis be performed? There are several algorithms. The choice depends on the task they're used for. The easiest way is to just record the voice of a person speaking the desired phrases. This is useful if only a restricted volume of phrases and sentences is used, e.g. messages in a train station, or schedule information via phone. The quality depends on the way recording is done. More sophisticated but worse in quality are algorithms which split the speech into smaller pieces. The smaller those units are, the less are they in number, but the quality also decreases. An often used unit is the phoneme, the smallest linguistic unit. Depending on the language used there are about 35-50 phonemes in western European languages, i.e. there are 35-50 single recordings. The problem is combining them as fluent speech requires fluent transitions between the elements. The intellegibility is therefore lower, but the memory required is small. A solution to this dilemma is using diphones. Instead of splitting at the transitions, the cut is done at the center of the phonemes, leaving the transitions themselves intact. This gives about 400 elements (20*20) and the quality increases. The longer the units become, the more elements are there, but the quality increases along with the memory required. Other units which are widely used are half-syllables, syllables, words, or combinations of them, e.g. word stems and inflectional endings. ------------------------------------------------------------------------ Q4.3: What are some good references/books on synthesis? The following are good introductory books/articles. Douglas O'Shaughnessy -- Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. D. H. Klatt, "Review of Text-To-Speech Conversion for English", Jnl. of the Acoustic Society of America (JASA), v82, Sept. 1987, pp 737-793. I. H. Witten. Principles of Computer Speech. (London: Academic Press, Inc., 1982). John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech: The MITalk System", Cambridge University Press, 1987. ------------------------------------------------------------------------ Q4.4: What software/hardware is available? There appears to be very little Public Domain or Shareware speech synthesis related software available for FTP. However, the following are available. Strictly speaking, not all the following sources are speech synthesis - all are speech output systems. SIMTEL-20 The following is a list of speech related software available from SIMTEL-20 and its mirror sites for PCs. The SIMTEL internet address is WSMR-SIMTEL20.Army.Mil [192.88.110.20]. Try looking at your nearest archive site first. Directory PD1: Filename Type Length Date Description ============================================== AUTOTALK.ARC B 23618 881216 Digitized speech for the PC CVOICE.ARC B 21335 891113 Tells time via voice response on PC HEARTYPE.ARC B 10112 880422 Hear what you are typing, crude voice synth. HELPME2.ARC B 8031 871130 Voice cries out 'Help Me!' from PC speaker SAY.ARC B 20224 860330 Computer Speech - using phonemes SPEECH98.ZIP B 41003 910628 Build speech (voice) on PC using 98 phonemes TALK.ARC B 8576 861109 BASIC program to demo talking on a PC speaker TRAN.ARC B 39766 890715 Repeats typed text in digital voice VDIGIT.ZIP B 196284 901223 Toolkit: Add digitized voice to your programs VGREET.ARC B 45281 900117 Voice says good morning/afternoon/evening Package: ORATOR Text-to-Speech Synthesizer Platform: SUN SPARC, Decstation 5000. Portable to other UNIX platforms. Description: Sophisticated speech synthesis package. Has text preprocessing (for abbreviations, numbers), acronym citation rules, and human-like spelling routines. High accuracy for pronunciation of names of people, places and businesses in America, text-to-speech translation for common words; rules for stress and intonation marking, based on natural-sounding demisyllable synthesis; various methods of user control and customization at most stages of processing. Currently, ORATOR is most appropriate for applications containing a large component of names in the text, and requires some amount of user- specified text-preprocessing to produce good quality speech for general text. Hardware: Standard audio output of SPARC, or Decstation audio hardware. At least 16M of memory recommended. Cost: Binary License: $5,000. Source license for porting or commercial use: $30,000. Availability: Contact Bellcore's Licensing Office (1-800-527-1080) or email: jzilg@cc.bellcore.com (John Zilg) Package: Text to phoneme program (1) Platform: unknown Description: Text to phoneme program. Based on Naval Research Lab's set of text to phoneme rules. Availability: By FTP from "shark.cse.fau.edu" (131.91.80.13) in the directory /pub/src/phon.tar.Z Package: Text to phoneme program (2) Platform: unknown Description: Text to phoneme program. Availability: By FTP from "wuarchive.wustl.edu" in the file /mirrors/unix-c/utils/phoneme.c Package: "Speak" - a Text to Speech Program Platform: Sun SPARC Description: Text to speech program based on concatenation of pre-recorded speech segments. A function library can be used to integrate speech output into other code. Hardware: SPARC audio I/O Availability: by FTP from "wilma.cs.brown.edu" as /pub/speak.tar.Z Package: TheBigMouth - a Text to Speech Program Platform: NeXT Description: Text to speech program based on concatenation of pre-recorded speech segments. NeXT equivalent of "Speak" for Suns. Availability: try NeXT archive sites such as sonata.cc.purdue.edu. Package: TextToSpeech Kit Platform: NeXT Computers Description: The TextToSpeech Kit does unrestricted conversion of English text to synthesized speech in real-time. The user has control over speaking rate, median pitch, stereo balance, volume, and intonation type. Text of any length can be spoken, and messages can be queued up, from multiple applications if desired. Real-time controls such as pause, continue, and erase are included. Pronunciations are derived primarily by dictionary look-up. The Main Dictionary has nearly 100,000 hand-edited pronunciations which can be supplemented or overridden with the User and Application dictionaries. A number parser handles numbers in any form. A letter-to-sound knowledge base provides pronunciations for words not in the Main or customized dictionaries. Dictionary search order is under user control. Special modes of text input are available for spelling and emphasis of words or phrases. The actual conversion of text to speech is done by the TextToSpeech Server. The Server runs as an independent task in the background, and can handle up to 50 client connections. Misc: The TextToSpeech Kit comes in two packages: the Developer Kit and the User Kit. The Developer Kit enables developers to build and test applications which incorporate text-to-speech. It includes the TextToSpeech Server, the TextToSpeech Object, the pronunciation editor PrEditor, several example applications, phonetic fonts, example source code, and developer documentation. The User Kit provides support for applications which incorporate text-to-speech. It is a subset of the Developer Kit. Hardware: Uses standard NeXT Computer hardware. Cost: TextToSpeech User Kit: $175 CDN ($145 US) TextToSpeech Developer Kit: $350 CDN ($290 US) Upgrade from User to Developer Kit: $175 CDN ($145 US) Availability: Trillium Sound Research 1500, 112 - 4th Ave. S.W., Calgary, Alberta, Canada, T2P 0H3 Tel: (403) 284-9278 Fax: (403) 282-6778 Order Desk: 1-800-L-ORATOR (US and Canada only) Email: manzara@cpsc.UCalgary.CA Package: SENSYN speech synthesizer Platform: PC, Mac, Sun, and NeXt Rough Cost: $300 Description: This formant synthesizer produces speech waveform files based on the (Klatt) KLSYN88 synthesizer. It is intended for laboratory and research use. Note that this is NOT a text-to-speech synthesizer, but creates speech sounds based upon a large number of input variables (formant frequencies, bandwidths, glottal pulse characteristics, etc.) and would be used as part of a TTS system. Includes full source code. Availability: Sensimetrics Corporation, 64 Sidney Street, Cambridge MA 02139. Fax: (617) 225-0470; Tel: (617) 225-2442. Email: sensimetrics@sens.com Package: SPCHSYN.EXE Platform: PC? Availability: By anonymous ftp from evans.ee.adfa.oz.au (131.236.30.24) in /mirrors/tibbs/Applications/SPCHSYN.EXE It is a self extracting DOS archive. Requirements: May require special TI product(s), but all source is there. Package: CSRE: Canadian Speech Research Environment Platform: PC Cost: Distributed on a cost recovery basis Description: CSRE is a software system which includes in addition to the Klatt speech synthesizer, SPEECH ANALYSIS and EXPERIMENT CONTROL SYSTEM. A paper about the whole package can be found in: Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc. of the Second Intl. Conf. on Spoken Language Processing, Edmonton: University of Alberta, pp. 1127-1130. Hardware: Can use a range of data aqcuisition/DSP Availability: For more information about the availability of this software contact Krystyna Marciniak - email march@uwovax.uwo.ca Tel (519) 661-3901 Fax (519) 661-3805. For technical information email ramji@uwovax.uwo.ca Note: A more detailed description is given in Q1.8 on speech environments. Package: JSRU Platform: UNIX and PC Cost: 100 pounds sterling (from academic institutions and industry) Description: A C version of the JSRU system, Version 2.3 is available. It's written in Turbo C but runs on most Unix systems with very little modification. A Form of Agreement must be signed to say that the software is required for research and development only. Contact: Dr. E.Lewis (eric.lewis@uk.ac.bristol) Package: Klatt-style synthesiser Platform: Unix Cost: FREE Description: Software posted to comp.speech in late 1992. Availability: By anonymous ftp from the comp.speech archives. Two files are available from the directory "comp.speech/sources". The files are "klatt-cti.tar.Z" and "klatt-jpi.tar.Z". The first is the original source, the second is a modified version. Package: MacinTalk Platform: Macintosh Cost: Free Description: Formant based speech synthesis. There is also a program called "tex-edit" which apparently can pronounce English sentences reasonably using Macintalk. Availability: By anonymous ftp from many archive sites (have a look on archie if you can). tex-edit is on many of the same sites. Try wuarchive.wustl.edu:/mirrors2/info-mac/Old/card/macintalk.hqx[.Z] /macintalk-stack.hqx[.Z] wuarchive.wustl.edu:/mirrors2/info-mac/app/tex-edit-15.hqx Package: Tinytalk Platform: PC Description: Shareware package is a speech 'screen reader' which is use by many blind users. Availability: By anonymous ftp from handicap.shel.isc-br.com. Get the files /speech/ttexe145.zip & /speech/ttdoc145.zip. Package: Bliss Contact: Dr. John Merus (Brown University) Mertus@browncog.bitnet Package: xxx Platform: (PC, Mac, Sun, NeXt etc) Rough Cost: (if appropriate) Description: (keep it brief) Hardware: (requirement list) Availability: (ftp info, email contact or company contact) Can anyone provide information on the following: Narrator (Amiga) - formant based synthesis speech synthesis chip sets? MultiVoice Monolog Please email or post suitable information for this list. Commercial, public domain and research packages are all appropriate. [Perhaps someone would like to start a separate posting on this area.] ======================================================================= PART 5 - Speech Recognition Q5.1: What is speech recognition? Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. ------------------------------------------------------------------------ Q5.2: How can I build a very simple speech recogniser? Doug Danforth provides a detailed account in article 253 in the comp.speech archives - also available as file info/DIY_Speech_Recognition. The first part is reproduced here. QUICKY RECOGNIZER sketch: Here is a simple recognizer that should give you 85%+ recognition accuracy. The accuracy is a function of WHAT words you have in your vocabulary. Long distinct words are easy. Short similar words are hard. You can get 98+% on the digits with this recognizer. Overview: (1) Find the begining and end of the utterance. (2) Filter the raw signal into frequency bands. (3) Cut the utterance into a fixed number of segments. (4) Average data for each band in each segment. (5) Store this pattern with its name. (6) Collect training set of about 3 repetitions of each pattern (word). (7) Recognize unknown by comparing its pattern against all patterns in the training set and returning the name of the pattern closest to the unknown. Many variations upon the theme can be made to improve the performance. Try different filtering of the raw signal and different processing methods. ------------------------------------------------------------------------ Q5.2: What does speaker dependent/adaptive/independent mean? A speaker dependent system is developed (trained) to operate for a single speaker. These systems are usually easier to develop, cheaper to buy and more accurate, but are not as flexible as speaker adaptive or speaker independent systems. A speaker independent system is developed (trained) to operate for any speaker or speakers of a particular type (e.g. male/female, American/English). These systems are the most difficult to develop, most expensive and currently accuracy is not as good. They are the most flexible. A speaker adaptive system is developed to adapt its operation for new speakers that it encounters usually based on a general model of speaker characteristics. It lies somewhere between speaker independent and speaker dependent systems. Each type of system is suited to different applications and domains. ------------------------------------------------------------------------ Q5.3: What does small/medium/large/very-large vocabulary mean? The size of vocabulary of a speech recognition system affects the complexity, processing requirements and the accuracy of the system. Some applications only require a few words (e.g. numbers only), others require very large dictionaries (e.g. dictation machines). There are no established definitions but the following may be a helpful guide. small vocabulary - tens of words medium vocabulary - hundreds of words large vocabulary - thousands of words very-large vocabulary - tens of thousands of words. ------------------------------------------------------------------------ Q5.4: What does continuous speech or isolated-word mean? An isolated-word system operates on single words at a time - requiring a pause between saying each word. This is the simplest form of recognition to perform, because the pronunciation of the words tends not affect each other. Because the occurrences of each particular word are similar they are easier to recognise. A continuous speech system operates on speech in which words are connected together, i.e. not separated by pauses. Continuous speech is more difficult to handle because of a variety of effects. First, it is difficult to find the start and end points of words. Another problem is "coarticulation". The production of each phoneme is affected by the production of surrounding phonemes, and similarly the the start and end of words are affected by the preceding and following words. The recognition of continuous speech is also affected by the rate of speech (fast speech tends to be harder). ------------------------------------------------------------------------ Q5.5: How is speech recognition done? A wide variety of techniques are used to perform speech recognition. There are many types of speech recognition. There are many levels of speech recognition/processing/understanding. Typically speech recognition starts with the digital sampling of speech. The next stage would be acoustic signal processing. Common techniques include a variety of spectral analyses, LPC analysis, the cepstral transform, cochlea modelling and many, many more. The next stage will typically try to recognise phonemes, groups of phonemes or words. This stage can be achieved by many processes such as DTW (Dynamic Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), and sometimes expert systems. In crude terms, all these processes to recognise the patterns of speech. The most advanced systems are statistically motivated. Some systems utilise knowledge of grammar to help with the recognition process. Some systems attempt to utilise prosody (pitch, stress, rhythm etc) to process the speech input. Some systems try to "understand" speech. That is, they try to convert the words into a representation of what the speaker intended to mean or achieve by what they said. ------------------------------------------------------------------------ Q5.6: What are some good references/books on recognition? Some general introduction books on speech recognition: Fundamentals of Speech Recognition; Lawrence Rabiner & Biing-Hwang Juang Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993 ISBN 0-13-015157-2 Speech recognition by machine; W.A. Ainsworth London: Peregrinus for the Institution of Electrical Engineers, c1988 Speech synthesis and recognition; J.N. Holmes Wokingham: Van Nostrand Reinhold, c1988 Douglas O'Shaughnessy -- Speech Communication: Human and Machine Addison Wesley series in Electrical Engineering: Digital Signal Processing, 1987. Electronic speech recognition: techniques, technology and applications edited by Geoff Bristow, London: Collins, 1986 Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu Lee. San Mateo: Morgan Kaufmann, c1990 More specific books/articles: Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A. Jack. Edinburgh: Edinburgh University Press, c1990 Automatic speech recognition: the development of the SPHINX system; by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989 Prosody and speech recognition; Alex Waibel (Pitman: London) (Morgan Kaufmann: San Mateo, Calif) 1988 S. E. Levinson, L. R. Rabiner and M. M. Sondhi, "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition" in Bell Syst. Tech. Jnl. v62(4), pp1035--1074, April 1983 R. P. Lippmann, "Review of Neural Networks for Speech Recognition", in Neural Computation, v1(1), pp 1-38, 1989. ------------------------------------------------------------------------ Q5.7: What speech recognition packages are available? Package Name: Votan Platform: MS-DOS, SCO UNIX Description: Isolated word and continuous speech modes, speaker dependant and (limited) speaker independent. Vocab size is 255 words or up to a fixed memory limit - but it is possible to dynamically load different words for effectively unlimited number of words. Rough Cost: Approx US $1,000-$1,500 Requirements: Cost includes one Votan Voice Recognition ISA-bus board for 386/486-based machines. A software development system is also available for DOS and Unix. Misc: Up to 8 Votan boards may co-exist for 8 simultaneous voice users. A telephone interface is also available. There is also a 4GL and a software development system. Apparently there is more than one version - more info required. Contact: 800-877-4756, 510-426-5600 Package: HTK (HMM Toolkit) - From Entropic Platform: Range of Unix platforms. Description: HTK is a software toolkit for building continuous density HMM based speech recognisers. It consists of a number of library modules and a number of tools. Functions include speech analysis, training tools, recognition tools, results analysis, and an interactive tool for speech labelling. Many standard forms of continuous density HMM are possible. Can perform isolated word or connected word speech recognition. It van model whole words, sub- word units. Can perform speaker verification and other pattern recognition work using HMMs. HTK is now integerated with the ESPS/Waves speech research environment which is described in Section 1.8 of this posting. Misc: The availability of HTK changed in early 1993 when Entropic obtained exclusive marketing rights to HTK from the developers at Cambridge. Cost: On request. Contact: Entropic Research Laboratory, Washington Research Laboratory, 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003 (202) 547-1420. email - info@wrl.epi.com Package Name: DragonDictate Platform: PC Description: Speaker dependent/adaptive system requiring words to be separated by short pauses. Vocabulary of 25,000 words including a "custom" word set. Rough Cost: ? Requirements: 386/486 with plenty of memory Contact: Dragon Systems Inc. 90 Bridge Street, Newton MA 02158 Tel: 1-617-965-5200, Fax: 1-617-527-0372 Product name: IN3 Voice Command For Windows Platform: PC with Windows 3.1 Description: Speech Recognition system simplifies the Windows interface by letting users call applications to the foreground with voice commands. Once the application is called, the user may enter commands and data with voice commands. IN3 (IN CUBE) is easily customized for any Windows application. IN3 is hardware-independent, letting users with any Windows-compatible audio add speech recognition to the desktop. IN3 is based on continuous word- spotting technology. Price: $179 U.S. Requirments: PC with 80386 processor or higher, Microsoft Windows 3.1. Misc: Fully functional demo is available on Compuserve in Multimedia Forum #6 (filename in3dem.zip). Contact: Brantley Kelly Email: cbk@gacc.atl.ga.us CIS: 75120,431 FAX: 1-404-925-7924 Phone: 1-404-925-7950 Command Corp. Inc, 3675 Crestwood Parkway, Duluth, GA 30136, USA Package Name: SayIt Platform: Sun SPARCstation Description: Voice recognition and macro building package for Suns in the Openwindows 3.0 environment. Speaker dependent discrete speech recognition. Vocabularies can be assocù¢6Xref: nuchat comp.speech:854 news.answers:9114 Path: nuchat!menudo.uh.edu!swrinde!gatech!howland.reston.ans.net!noc.near.net!pad-thai.aktis.com!pad-thai.aktis.com!not-for-mail From: andrewh@ee.su.oz.au (Andrew Hunt) Newsgroups: comp.speech,comp.answers,news.answers Subject: comp.speech FAQ (Frequently Asked Questions) Supersedes: Followup-To: comp.speech Date: 8 Jun 1993 00:00:19 -0400 Organization: Speech Technology Group, The University of Sydney Lines: 1758 Sender: faqserv@GZA.COM Approved: news-answers-request@MIT.Edu Expires: 20 Jul 1993 04:00:07 GMT Message-ID: Reply-To: andrewh@ee.su.oz.au (Andrew Hunt) NNTP-Posting-Host: pad-thai.aktis.com Summary: Useful information about Speech Technology X-Last-Updatished ã research papers. The components are:ã 1. A preprocessor which implements many standard and many non-ã standard front end processing techniques.ã 2. A recurrent net recogniser and parameter filesã 3. Two Markov model based recognisers, one for phone recognition ã and one for word recognitionã 4. A dynamic programming scoring packageã The complete system performs competatively.ãCost: FreeãRequirements: TIMIT and Resource Management databasesãContact: ajr@eng.cam.ac.uk (Tony Robinson)ãAvailability: by FTP from "svr-ftp.eng.cam.ac.uk" as /misc/recnet-1.0.tarãããPackage Name: Voice Command Line InterfaceãPlatform: AmigaãDescription: VCLI will execute CLI commands, ARexx commands, or ARexx ã scripts by voice command through your audio digitizer. VCLI allows ã you to launch multiple applications or control any program with an ã ARexx capability entirely by spoken voice command. VCLI is fully ã multitasking and will run in the background, continuously listeningã for your voice commands even while other programs are running.ã Documentation is provided in AmigaGuide format.ã VCLI 6.0 runs under either Amiga DOS 2.0 or 3.0.ãCost: Free?ãRequirements: Supports the DSS8, PerfectSound 3, Sound Master, Sound Magic, ã and Generic audio digitizers.ãAvailability: by ftp from wuarchive.wustl.edu in the fileã systems/amiga/incoming/audio/VCLI60.lha and fromã amiga.physik.unizh.ch as the file pub/aminet/util/misc/VCLI60.lhaãContact: Author's email is RHorne@cup.portal.comãããPackage Name: xxxãPlatform: PC, Mac, UNIX, Amiga ....ãDescription: (e.g. isolated word, speaker independent...)ãRough Cost: (if applicable)ãRequirements: (hardware/software needs - if applicable)ãMisc:ãContact: (email, ftp or address)ãããCan anyone provide info onãã Voice Navigator (from Articulate Systems)ã IN3 Voice CommandãããCan you provide information on any other software/hardware/packages?ãCommercial, public domain and research packages are all appropriate.ãã[There should be enough info for someone to start a separate posting.]ããã=======================================================================ããPART 6 - Natural Language ProcessingããThere is now a newsgroup specifically for Natural Language Processing.ãIt is called comp.ai.nat-lang. ããThere is also a lot of useful information on Natural Language Processing ãin the FAQ for comp.ai. That FAQ lists available software and useful ãreferences. It includes a substantial list of software, documentation ãand other info available by ftp.ãã------------------------------------------------------------------------ããQ6.1: What are some good references/books on NLP?ãããTake a look at the FAQ for the "comp.ai" newsgroup as it also includes some ãuseful references.ããã James Allen: Natural Language Understanding. (Benjamin/Cummings Series inã Computer Science) Menlo Park: Benjamin/Cummings Publishing Company, 1987.ãã This book consists of four parts: syntactic processing, semanticã interpretation, context and world knowledge, and response generation.ãã G. Gazdar and C. Mellish, Natural Language Processing in {Prolog/Lisp/Pop11},ã Addison Wesley, 1989ãã Emphasis on parsing, especially unification-based parsing, lots of ã details on the lexicon, feature propagation, etc. Fair coverage of ã semantic interpretation, inference in natural language processing, ã and pragmatics; much less extensive than in Allen's book, but moreã formal. There are three versions, one for each programming language ã listed above, with complete code.ãã Shapiro, Stuart C.: Encyclopedia of Artificial Intelligence Vol.1 and 2.ã New York: John Wiley & Sons, 1990.ãã There are articles on the different areas of natural languageã processing which also give additional references.ãã Paris, Ce'cile L.; Swartout, William R.; Mann, William C.: Natural Languageã Generation in Artificial Intelligence and Computational Linguistics. Boston:ã Kluwer Academic Publishers, 1991.ãã The book describes the most current research developments in natural ã language generation and all aspects of the generation process areã discussed. The book is comprised of three sections: one on textã planning, one on lexical choice, and one on grammar.ãã Readings in Natural Language Processing, ed by B. Grosz, K. Sparck Jonesã and B. Webber, Morgan Kaufmann, 1986ãã A collection of classic papers on Natural Language Processing. ã Fairly complete at the time the book came out (1986) but now ã seriously out of date. Still useful for ATN's, etc.ãã Klaus K. Obermeier, Natural Language Processing Technologiesã in Artificial Intelligence: The Science and Industry Perspective,ã Ellis Horwood Ltd, John Wiley & Sons, Chichester, England, 1989.ãããThe major journals of the field are "Computational Linguistics" and ã"Cognitive Science" for the artificial intelligence aspects, "Cognition" ãfor the psychological aspects, "Language", "Linguistics and Philosophy" and ã"Linguistic Inquiry" for the linguistic aspects. "Artificial Intelligence" ãoccasionally has papers on natural language processing.ãããThe major conferences are ACL (held every year) and COLING (held every twoãyears). Most AI conferences have a NLP track; AAAI, ECAI, IJCAI and theãCognitive Science Society conferences usually are the most interesting for ãNLP. CUNY is an important psycholinguistic conference. There are lots of ãlinguistic conferences: the most important seem to be NELS, the conference ãof the Chicago Linguistic Society (CLS), WCCFL, LSA, the Amsterdam Colloquium,ãand SALT. ããã------------------------------------------------------------------------ããQ6.2: What NLP software is available?ããThe FAQ for the "comp.ai" newsgroup lists a variety of language processing ãsoftware that is available. That FAQ is posted monthly.ããNatural Language Software RegistryããThe Natural Language Software Registry is available from the German Research ãInstitute for Artificial Intelligence (DFKI) in Saarbrucken.ããThe current version details ã + speech signal processors, e.g. Computerized Speech Lab (Kay Electronics)ã + morphological analyzers, e.g. PC-KIMMO (Summer Institute for Linguistics)ã + parsers, e.g. Alveytools (University of Edinburgh)ã + knowledge representation systems, e.g. Rhet (University of Rochester)ã + multicomponent systems, such as ELU (ISSCO), PENMAN (ISI), Pundit (UNISYS),ã SNePS (SUNY Buffalo),ã + applications programs (misc.)ããThis document is available on-line via anonymous ftp to ã Site: ftp.dfki.uni-sb.de ã Directory: /registry ãor by email to registry@dfki.uni-sb.de.ããIf you have developed a piece of software for natural language processing ãthat other researchers might find useful, you can include it by returning ãa description form, available from the same source.ããContacts: Christoph Jung, Markus Vonerden ã Natural Language Software Registryã Deutsches Forschungsinstitut fuer Kuenstliche Intelligenz (DFKI)ã Stuhlsatzenhausweg 3ã D-W-6600 Saarbrueckenã Germanyãã phone: +49 (681) 303-5282ã e-mail: registry@dfki.uni-sb.deããã ã ãAndrew HuntãSpeech Technology Research Group Ph: 61-2-692 4509ãDept. of Electrical Engineering Fax: 61-2-692 3847ãUniversity of Sydney, NSW, 2006, Australia email: andrewh@ee.su.oz.auã